Exploring Places

Screen%20Shot%202022-01-30%20at%2011.24.35.png

Experiments:

    1. Visualizing Places dataset
    1. Exploring Tags Places
    1. Exploring Towns & Places Names
    1. Exploring Properities
    1. Exploring Descriptions Places Similarities
    1. Descriptions Places Topic Modelling
In [1]:
import json
import pandas as pd
import plotly.express as px
import os
import plotly.graph_objects as go
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from bertopic import BERTopic
In [2]:
#data="places.json"
data="dataset/sample_20200501.json"
with open(data, 'r') as f:
    data = json.load(f)
    print(len(data["places"]))
    places=data["places"]
df = pd.DataFrame(places)
519

2. Visualizing the places dataframe

In [3]:
df["properties"].iloc[0]
Out[3]:
{'place.child-restrictions': True,
 'place.facilities.free-wifi': True,
 'place.facilities.dogs-allowed': False,
 'place.facilities.parking': True,
 'place.facilities.toilets': True,
 'place.facilities.toilets_disabled': False,
 'place.facilities.wheelchair-access': False,
 'place.capacity.max': '160'}
In [4]:
df.shape[0]
Out[4]:
519

Experiment 1: Exploring Place Ids

In [5]:
df_ids=df.groupby(['place_id']).size().reset_index()
df_ids=df_ids.rename(columns={0: "number_of_times"}).sort_values(by=['number_of_times'], ascending=False)
df_ids
Out[5]:
place_id number_of_times
0 1 1
325 52196 1
355 56354 1
354 56346 1
353 56343 1
... ... ...
168 18603 1
167 18590 1
166 18577 1
165 18443 1
518 128456 1

519 rows × 2 columns

Experiment 2: Exploring Tags Places

We are going to separete the elements stored in each tag list into new rows.

In [6]:
df["tags"][0:5]
Out[6]:
0        [Bar & pub food, Comedy, Restaurants, Venues]
1    [Cinemas, Community centre, Public buildings, ...
2    [Arts Centre, Galleries, Language School, Publ...
3                         [Conference Centres, Venues]
4                                   [Theatres, Venues]
Name: tags, dtype: object
In [7]:
df_tags=df.explode('tags')
In [8]:
df_tags
Out[8]:
address email postal_code properties sort_name town website place_id modified_ts created_ts name loc country_code tags descriptions phone_numbers status
0 5 York Place admin@thestand.co.uk EH1 3EB {'place.child-restrictions': True, 'place.faci... Stand Edinburgh http://www.thestand.co.uk 1 2021-11-24T12:18:33Z 2021-11-24T12:18:33Z The Stand {'latitude': '55.955806109395006', 'longitude'... GB Bar & pub food [{'type': 'description.list.default', 'descrip... {'info': '0131 558 7272', 'box_office': '0131 ... live
0 5 York Place admin@thestand.co.uk EH1 3EB {'place.child-restrictions': True, 'place.faci... Stand Edinburgh http://www.thestand.co.uk 1 2021-11-24T12:18:33Z 2021-11-24T12:18:33Z The Stand {'latitude': '55.955806109395006', 'longitude'... GB Comedy [{'type': 'description.list.default', 'descrip... {'info': '0131 558 7272', 'box_office': '0131 ... live
0 5 York Place admin@thestand.co.uk EH1 3EB {'place.child-restrictions': True, 'place.faci... Stand Edinburgh http://www.thestand.co.uk 1 2021-11-24T12:18:33Z 2021-11-24T12:18:33Z The Stand {'latitude': '55.955806109395006', 'longitude'... GB Restaurants [{'type': 'description.list.default', 'descrip... {'info': '0131 558 7272', 'box_office': '0131 ... live
0 5 York Place admin@thestand.co.uk EH1 3EB {'place.child-restrictions': True, 'place.faci... Stand Edinburgh http://www.thestand.co.uk 1 2021-11-24T12:18:33Z 2021-11-24T12:18:33Z The Stand {'latitude': '55.955806109395006', 'longitude'... GB Venues [{'type': 'description.list.default', 'descrip... {'info': '0131 558 7272', 'box_office': '0131 ... live
1 10 Orwell Terrace NaN EH11 2DY NaN St Bride's Centre Edinburgh http://stbrides.wordpress.com 371 2019-12-04T13:27:26Z 2019-12-04T13:27:26Z St Bride's Centre {'latitude': '55.94255035', 'longitude': '-3.2... GB Cinemas [{'type': 'description.list.default', 'descrip... {'info': '0131 346 1405'} live
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
518 99 Leith Walk leitharches@gmail.com EH6 8LY NaN Under the Arch Edinburgh https://www.leitharches.com/under-the-arch 128456 2020-10-23T17:13:49Z 2020-10-23T17:13:49Z Under the Arch {'latitude': '55.968951138337516', 'longitude'... GB Beer NaN NaN live
518 99 Leith Walk leitharches@gmail.com EH6 8LY NaN Under the Arch Edinburgh https://www.leitharches.com/under-the-arch 128456 2020-10-23T17:13:49Z 2020-10-23T17:13:49Z Under the Arch {'latitude': '55.968951138337516', 'longitude'... GB Food & drink NaN NaN live
518 99 Leith Walk leitharches@gmail.com EH6 8LY NaN Under the Arch Edinburgh https://www.leitharches.com/under-the-arch 128456 2020-10-23T17:13:49Z 2020-10-23T17:13:49Z Under the Arch {'latitude': '55.968951138337516', 'longitude'... GB Pop-up NaN NaN live
518 99 Leith Walk leitharches@gmail.com EH6 8LY NaN Under the Arch Edinburgh https://www.leitharches.com/under-the-arch 128456 2020-10-23T17:13:49Z 2020-10-23T17:13:49Z Under the Arch {'latitude': '55.968951138337516', 'longitude'... GB Restaurants NaN NaN live
518 99 Leith Walk leitharches@gmail.com EH6 8LY NaN Under the Arch Edinburgh https://www.leitharches.com/under-the-arch 128456 2020-10-23T17:13:49Z 2020-10-23T17:13:49Z Under the Arch {'latitude': '55.968951138337516', 'longitude'... GB Wine NaN NaN live

1362 rows × 17 columns

In [9]:
g_tags=df_tags.groupby(['tags']).size().reset_index()
g_tags=g_tags.rename(columns={0: "number_of_times"}).sort_values(by=['number_of_times'], ascending=False)
g_tags
Out[9]:
tags number_of_times
144 Outdoors 120
204 Venues 110
96 Gardens 77
154 Public buildings 74
92 Galleries 70
... ... ...
90 Fringe venue 1
91 Funk 1
97 German 1
99 Gluten free 1
220 student 1

221 rows × 2 columns

In [10]:
px.histogram(g_tags, x="tags", y="number_of_times", histfunc="sum", color="tags", title='Frequency of tags places')

Experiment 3: Exploring Towns & Names

In [11]:
df["town"][1:10]
Out[11]:
1    Edinburgh
2    Edinburgh
3    Edinburgh
4    Edinburgh
5    Edinburgh
6    Edinburgh
7    Edinburgh
8    Edinburgh
9    Edinburgh
Name: town, dtype: object

3.1 Frequency of places grouped by towns

In [12]:
df_town=df.dropna(subset=['town'])
town=df_town.groupby(['town']).size().reset_index()
town=town.rename(columns={0: "number_of_times"})
town=town.drop([0])
In [13]:
town=town.sort_values(by=['number_of_times'], ascending=False)
town
Out[13]:
town number_of_times
34 Edinburgh 280
53 Kirkcaldy 14
82 St Andrews 13
28 Dunfermline 12
20 Crail 11
... ... ...
45 Haddington 1
2 Aberlady 1
49 Inveresk 1
54 Kirkliston 1
92 Yetholm 1

92 rows × 2 columns

In [14]:
px.scatter(town, x='town', y='number_of_times', color='number_of_times',  size="number_of_times", size_max=60, title="Frequency of places grouped by towns")

3.2 Frequency of places grouped by name

In [15]:
df_name_town=df.groupby(['name']).size().reset_index()
df_name_town=df_name_town.rename(columns={0: "number_of_times"})
df_name_town=df_name_town.sort_values(by=['number_of_times'], ascending=False)
df_name_town.reset_index()
Out[15]:
index name number_of_times
0 473 Various Venues 2
1 0 &Gallery 1
2 355 St James Scottish Episcopal Church 1
3 353 St Fort Woodland Garden with Tayfield and Will... 1
4 352 St Cuthbert's Church 1
... ... ... ...
513 167 Hidden Gardens of Newburgh 1
514 166 Heriot-Watt University 1
515 165 Henry's Cellar Bar 1
516 164 Hen's Ladder 1
517 517 theSpace on the Mile 1

518 rows × 3 columns

3.3. Frequency of places grouped by name and town

In [16]:
df_name_town=df.groupby(['name', 'town']).size().reset_index()
df_name_town=df_name_town.rename(columns={0: "number_of_times"})
df_name_town=df_name_town.sort_values(by=['number_of_times'], ascending=False)
df_name_town
Out[16]:
name town number_of_times
0 &Gallery Edinburgh 1
325 Scottish National Portrait Gallery Edinburgh 1
355 St James Scottish Episcopal Church Edinburgh 1
354 St Giles' Cathedral Edinburgh 1
353 St Fort Woodland Garden with Tayfield and Will... Newport 1
... ... ... ...
168 Hill of Tarvit Mansion & Garden Cupar 1
167 Hidden Gardens of Newburgh Newburgh 1
166 Heriot-Watt University Galashiels 1
165 Henry's Cellar Bar Edinburgh 1
518 theSpace on the Mile Edinburgh 1

519 rows × 3 columns

Experiment 4: Exploring Properities

In [17]:
df_properties=pd.concat([df.drop(['properties'], axis=1), df['properties'].apply(pd.Series)], axis=1)
In [18]:
df_properties[0:3]
Out[18]:
address email postal_code sort_name town website place_id modified_ts created_ts name ... place.child-restrictions place.facilities.dogs-allowed place.facilities.free-wifi place.facilities.guide-dogs place.facilities.hearing-loop place.facilities.parking place.facilities.toilets place.facilities.toilets_baby-changing place.facilities.toilets_disabled place.facilities.wheelchair-access
0 5 York Place admin@thestand.co.uk EH1 3EB Stand Edinburgh http://www.thestand.co.uk 1 2021-11-24T12:18:33Z 2021-11-24T12:18:33Z The Stand ... True False True NaN NaN True True NaN False False
1 10 Orwell Terrace NaN EH11 2DY St Bride's Centre Edinburgh http://stbrides.wordpress.com 371 2019-12-04T13:27:26Z 2019-12-04T13:27:26Z St Bride's Centre ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 West Parliament Square ifecosse.edimbourg-cslt@diplomatie.gouv.fr EH1 1RN Institut Français d'Ecosse Edinburgh http://www.ifecosse.org.uk 372 2021-02-23T16:57:44Z 2021-02-23T16:57:44Z Institut Français d'Ecosse ... NaN NaN False NaN NaN False False NaN False True

3 rows × 29 columns

4.1 Frequency of places grouped by wheelchair-access and town

In [19]:
df_properties_wc=df_properties.groupby(['place.facilities.wheelchair-access', 'town']).size().reset_index()
df_properties_wc=df_properties_wc.rename(columns={0: "number_of_times"})
df_properties_wc=df_properties_wc.sort_values(by=['number_of_times'], ascending=False)
df_properties_wc
Out[19]:
place.facilities.wheelchair-access town number_of_times
16 True Edinburgh 57
2 False Edinburgh 23
14 True Dunfermline 4
27 True Musselburgh 3
23 True Livingston 2
29 True St Andrews 2
20 True Hawick 2
22 True Kirkcaldy 2
19 True Glenrothes 1
21 True Kelso 1
25 True Lochgelly 1
24 True Loanhead 1
17 True Falkland 1
26 True Melrose 1
28 True North Berwick 1
30 True St Monans 1
18 True Galashiels 1
0 False Bathgate 1
1 False Dalkeith 1
15 True East Linton 1
13 True Dirleton 1
12 True Cupar 1
11 True Coldstream 1
10 True Cockenzie 1
9 True Anstruther 1
8 True Aberlady 1
7 False Wilkieston 1
6 False South Queensferry 1
5 False Peebles 1
4 False Hawick 1
3 False Galashiels 1
31 True Tranent 1

4.2 Frequency of places grouped by toilets_disabled and town

In [20]:
df_properties_td=df_properties.groupby(['place.facilities.toilets_disabled', 'town']).size().reset_index()
df_properties_td=df_properties_td.rename(columns={0: "number_of_times"})
df_properties_td=df_properties_td.sort_values(by=['number_of_times'], ascending=False)
df_properties_td
Out[20]:
place.facilities.toilets_disabled town number_of_times
14 True Edinburgh 53
5 False Edinburgh 24
13 True Dunfermline 3
18 True Hawick 3
16 True Galashiels 2
21 True Livingston 2
27 True St Andrews 2
25 True Musselburgh 2
20 True Kirkcaldy 2
19 True Kelso 1
0 False Aberlady 1
22 True Loanhead 1
23 True Lochgelly 1
26 True North Berwick 1
28 True St Monans 1
29 True Tranent 1
24 True Melrose 1
15 True Falkland 1
17 True Glenrothes 1
1 False Bathgate 1
12 True Dalkeith 1
11 True Coldstream 1
10 True Cockenzie 1
9 True Anstruther 1
8 False South Queensferry 1
7 False Peebles 1
6 False Musselburgh 1
4 False East Linton 1
3 False Dunfermline 1
2 False Cupar 1
30 True Wilkieston 1

5. Exploring Descriptions

In [21]:
df_descriptions=df.explode('descriptions')
df_descriptions=pd.concat([df_descriptions.drop(['descriptions'], axis=1), df_descriptions['descriptions'].apply(pd.Series)], axis=1)
df_descriptions=df_descriptions.dropna(subset=['description']).reset_index()
documents=df_descriptions["description"].values
In [22]:
len(documents)
Out[22]:
215
In [23]:
import re 
from gensim.parsing.preprocessing import remove_stopwords
def clean_documents(text):
    text = re.sub(r'\S*@\S*\s?', '', text, flags=re.MULTILINE) # remove email
    text = re.sub(r'http\S+', '', text, flags=re.MULTILINE) # remove web addresses
    text = re.sub("\'", "", text) # remove single quotes
    text = remove_stopwords(text)
    return text
In [24]:
d=[]
for text in documents:
    d.append(clean_documents(text))

Generating Text Embeddings

In [25]:
model = SentenceTransformer('all-MiniLM-L6-v2')
#Training our text_embeddings - using the descriptions available & all-MiniLM-L6-v2 Transformer
text_embeddings = model.encode(d, batch_size = 8, show_progress_bar = True)

In [26]:
np.shape(text_embeddings)
Out[26]:
(215, 384)

Description Similarity

In [27]:
similarities = cosine_similarity(text_embeddings)
similarities_sorted = similarities.argsort()
id_1 = []
id_2 = []
score = []
for index,array in enumerate(similarities_sorted):
    p=len(array)
    id_1.append(index)
    id_2.append(array[-2])
    score.append(similarities[index][array[-2]])
index_df = pd.DataFrame({'id_1' : id_1,
                          'id_2' : id_2,
                          'score' : score})
print(index_df)
     id_1  id_2     score
0       0   200  0.387467
1       1    45  0.624832
2       2    12  0.488309
3       3    78  0.550572
4       4    78  0.648187
..    ...   ...       ...
210   210    24  0.544140
211   211   133  0.365551
212   212   213  0.559567
213   213   212  0.559567
214   214    94  0.543185

[215 rows x 3 columns]
In [28]:
index_df["score"].sort_values(ascending=False)
Out[28]:
119    0.988307
120    0.988307
51     0.889471
52     0.889471
65     0.844464
         ...   
140    0.388628
182    0.388069
0      0.387467
211    0.365551
202    0.358468
Name: score, Length: 215, dtype: float32
In [29]:
index_df.iloc[85]
Out[29]:
id_1     85.000000
id_2     86.000000
score     0.827258
Name: 85, dtype: float64

NOTE: Documents 119 and 120 seems to be the most similar. Lets see what they have

In [36]:
documents[119]
Out[36]:
"The Real Mary King’s Close is one of Scotland’s most unique historic sites. It took its name from one Mary King, a merchant burgess who resided on the close in the 17th century. Due to the building of the Royal Exchange in the 18th century, the close was partially demolished and buried, and was later closed to the public for many years. The area became shrouded in myths and urban legends with many tales of hauntings and murders. The Real Mary King's Close now operates as a tourist attraction with guided tours. \n\nBeneath the famous Royal Mile, discover the hidden streets, homes and passageways where citizens of Edinburgh lived, worked and died in the 16th and 17th centuries. The Real Mary King's Close is Edinburgh’s only preserved 17th century street, featuring a labyrinth of Old Town alleyways. Tours in these subterranean chambers are led by guides in the character of real people who lived in the close. There are also souvenir shops and a courtyard café. Tours last one hour and are fully guided."
In [37]:
documents[120]
Out[37]:
"The Real Mary King’s Close is one of Scotland’s most unique historic sites. It took its name from one Mary King, a merchant burgess who resided on the close in the 17th century. Due to the building of the Royal Exchange in the 18th century, the close was partially demolished and buried, and was later closed to the public for many years. The Real Mary King's Close now operates as a tourist attraction with guided tours.\n\nBeneath the famous Royal Mile, discover the hidden streets, homes and passageways where citizens of Edinburgh lived, worked and died in the 16th and 17th centuries. The Real Mary King's Close is Edinburgh’s only preserved 17th century street, featuring a labyrinth of Old Town alleyways. Tours in these subterranean chambers are led by guides in the character of real people who lived in the close. There are also souvenir shops and a courtyard café. Tours last one hour and are fully guided."

6. Topic Modelling

In [32]:
topic_model = BERTopic(min_topic_size=10).fit(d, text_embeddings)
topics, probs = topic_model.transform(d, text_embeddings)
topic_model.visualize_topics()
In [33]:
topic_model.visualize_barchart()
In [34]:
topic_model.visualize_heatmap()
In [35]:
topic_model.get_topic_freq()
Out[35]:
Topic Count
0 -1 90
1 0 46
2 1 24
3 2 23
4 3 19
5 4 13